Recent research on fine-tuning large language models (LLMs) through the aggregation of multiple preferences has attracted considerable attention. However, the existing literature predominantly focuses on the empirical performance of aggregation algorithms, while neglecting the underlying motivation for agents to misreport their preferences. In this paper, we formalize this as a multi-parameter mechanism design problem, where an LLM provider designs both training and payment rules to achieve specific objectives and promote the truthful reporting of preferences. Firstly, we claim the necessity of a payment scheme by demonstrating that without payments, truth-telling is a strictly dominated strategy under a wide range of training rules. Then, we introduce the affine maximizer payment scheme for the social welfare maximizing training rules that are widely used in practice, which ensures both dominant-strategy incentive compatibility (DSIC) and individual rationality (IR). Furthermore, we prove that under mild conditions, any other payment rule that also implements these training rules in DSIC can be converted to the affine maximizer payment by adding a factor irrelevant to the agents’ own reports. We also show that this mechanism satisfies approximate DSIC when the input of the mechanism is a biased version of the reported preferences, showcasing its robustness in real-world applications.
The pre-training and fine-tuning paradigm is fundamental in developing language models ( Devlin et al. ( 2018 ); Radford et al. ( 2018 ); Liu et al. ( 2019 ); Touvron et al. ( 2023 ) ). During pre-training, the model is fed with vast amounts of data to acquire a general capability to understand and generate language through self-supervised learning. The subsequent fine-tuning phase customizes these pre-trained models for specific downstream tasks using smaller, task-oriented datasets, ensuring that the model outputs are more closely aligned with particular requirements. As LLMs gain increasing popularity, there is a growing demand for fine-tuning basic LLMs, as basic models often fail to meet users’ demands, especially in catering to individual preferences.
The process of fine-tuning an LLM to align with certain human preferences is challenging to achieve through supervision ( Ji et al. ( 2023 ); Köpf et al. ( 2024 ); Wang et al. ( 2023b ); Shen et al. ( 2023 ) ), primarily due to the difficulty in constructing datasets with a substantial number of valid question-answer pairs for supervised training. Reinforcement learning from human feedback (RLHF) ( Ouyang et al. ( 2022 ); Christiano et al. ( 2017 ) ) offers a promising solution to this problem. In RLHF, a reward model is first trained to be used as a proxy for human judgment. This model then provides reward signals for the standard reinforcement learning process. This technique of fine-tuning with a reward model has proven effective in encoding human preferences into models and has become a fundamental component of the training process for most advanced LLMs. With the advancement of RLHF, numerous studies have investigated efficient methods for aggregating multiple preferences into a single fine-tuned model.
However, most of these studies focus primarily on improving empirical performance across various metrics ( Ramé et al. ( 2024 ); Wu et al. ( 2024 ); Jang et al. ( 2023 ); Coste et al. ( 2023 ); Zhang et al. ( 2024 ); Wang et al. ( 2024 ); Eisenstein et al. ( 2023 ) ). They often implicitly assume that we are accessible to real preferences, neglecting the possibility of agents’ misreporting their preferences. This problem becomes more crucial when we consider a real-world scenario, where different agents provide their preferences for the aggregation. In such cases, agents may engage in strategic misreporting to increase their utility. An intuitive example is that if an agent knows beforehand that the fine-tuning process aims to neutralize all preferences, it might pretend to have a more polarized preference as a beneficial strategy. These strategic behaviors can distort the final training results, even if the trained algorithm is highly effective. Nevertheless, this issue has not attracted sufficient attention in the existing literature, particularly concerning the fine-tuning process of LLMs.
In this paper, we mainly study the incentive design in such scenarios. First, we formalize this as a multi-parameter mechanism design problem between a fine-tuning service provider and groups of agents seeking fine-tuning services. The provider proposes a mechanism that includes a training rule for integrating different groups’ preferences into a fine-tuned model and a payment rule to charge the groups. After observing the mechanism, each group strategically reports its preference to maximize its utility. We consider that the subsequent fine-tuning process is implemented using RLHF, a standard method for aligning a model with human preference. Therefore, we abstract the preference of each group to be reward models, and term the whole scenario the RLHF Game .
Secondly, we demonstrate the profitability of misreporting a polarized preference under a wide range of mechanisms that include only a training rule ( Theorem 3.3 ). This underscores the necessity of a payment rule to address incentive issues.
Thirdly, we focus on a representative set of training rules, termed the SW-Maximizing training rules, in which the provider aims to maximize social welfare while incorporating different regularization measures. For SW-Maximizing training rules, we propose the affine maximizer payment scheme, a weighted version of the Vickrey-Clarke-Groves (VCG) payment ( Vickrey ( 1961 ); Clarke ( 1971 ); Groves ( 1973 ) ). We prove that agents truthfully reporting their preferences constitutes a dominant strategy in such mechanisms ( Theorem 4.2 ). Utilizing the notion of payment equivalence, we prove that under a mild condition, any other payment rule that also implements these training rules in dominant-strategy incentive compatibility (DSIC) can be converted to the affine maximizer payment by adding a factor irrelevant to agents’ own reports ( Theorem 4.5 ). We validate this condition for many commonly used regularization terms like KL-divergence ( Proposition 4.4 ). Consequently, we derive the revenue-maximizing payment rule that implements SW-Maximizing training rules in both DSIC and individual rationality (IR) ( Corollary 4.6 ). Furthermore, we show that this mechanism remains approximately DSIC when the input of the mechanism is a biased version of the reported preferences, which is an abstraction modeling for the inevitable errors that occur in practice. This showcases the robustness of the proposed mechanisms in real-world applications ( Theorem 4.9 ).
Several studies have investigated similar scenarios. Among them, Duetting et al. ( 2023 ) and Soumalias et al. ( 2024 ) are most related to ours. Duetting et al. ( 2023 ) examines the problem of designing a mechanism to aggregate multiple agents’ preferences based on each agent’s bids and determine their payments. However, they exclude the case where preferences can be misreported, which is the primary concern in our study. The concurrent work by Soumalias et al. ( 2024 ) also considers the mechanism design for aggregating multiple preferences. Their focus is mainly on the practical implementation of SW-Maximizing training rule with KL-divergence and the payment scheme that obtains both DSIC and interpretability. However, in this scenario, we are more concerned with the theoretical properties of more general mechanisms, including the implementability and the property of payment equivalence.
Additionally, there are works studying other scenarios related to LLMs from the perspective of algorithmic game theory. Laufer et al. ( 2023 ) abstracts the fine-tuning process as a bargaining game and characterizes the perfect sub-game equilibria. Dubey et al. ( 2024 ) proposes an auction where bidders compete to place their content within a summary generated by an LLM. Conitzer et al. ( 2024 ) considers incorporating social choice theory in LLM alignment. Feizi et al. ( 2023 ) explores the potential for leveraging LLMs in online advertising systems.
In Section 2 , we provide the preliminaries and the formal description of the RLHF Game. In Section 3 , we study the incentive design for general training rules in the RLHF Game. We demonstrate the properties of mechanisms that consist of SW-Maximizing training rules and payment rules in Section 4 . Further related work is provided in Section 5 , and we conclude in Section 6 .
Large language models (LLMs) function as mappings from a sequence of tokens to a probability distribution over the next token. The input sequence is usually constrained by a maximum length , thereby making the set of all possible inputs finite. Let denote the finite set of all tokens, and let represent the set of all possible input sequences with lengths up to .
An LLM parameterized by is denoted as , where is the set of all probability distributions over the token set . For practical purposes, the output sequence is also required to be of finite length. We assume the maximum output length is also so that the output space is also . We denote as the probability of a sequence of tokens generated by . Since the model generates a sequence by predicting the next token iteratively until a special ending token is encountered, the relationship between and is given by:
where denotes the prefix subsequence of preceding and . is a distribution over and can be represented as a -dimensional vector, with each coordinate the probability of being generated under .
Reward modeling is instrumental for aligning LLMs with human preferences, particularly within the context of RLHF. In this process, a reward model is first trained on the human-annotated preference dataset by using the Bradley-Terry model ( Bradley and Terry ( 1952 ) ). Essentially, the reward model is a function that maps a sequence of tokens to a real number indicative of the preference for that sequence. Similar to , rm can be also considered as a -dimensional vector. Following prior empirical work for RLHF ( Rafailov et al. ( 2023 ) ), we consider the normalized reward models which are normalized to have the summation over i.e. . Furthermore, we also assume that the output rewards are all non-negative, i.e. for all . Unless otherwise stated, we use to denote the domain of all reward model functions that satisfy the above conditions. In fact, the results in our paper are also applicable for some other normalization methods like .
In this part, we present the formal description of the RLHF Game. There is one LLM provider and groups of agents , denoted by . The provider has an initial model with non-negative probability for all sequences, i.e. for all . Each group has agents and a joint preference represented by a reward model . Let and denote the domains for each group’s reward model and group size, respectively. We assume an upper bound for . The exact reward model and the size are group ’s private information. For an agent in group , the valuation when it receives a model is denoted by . The form of the valuation function is known by both the provider and the agents.
The provider first announces the mechanism, including a training rule and a payment rule ,
Both rules take reported reward models, reported sizes, and an initial model as input, and output the objective fine-tuned model and each group’s payment, respectively. The provider can choose not to charge the users by setting always equal to . In this case, the model coincides with most previous work, where agents’ incentives are not considered ( Ramé et al. ( 2024 ); Wu et al. ( 2024 ); Jang et al. ( 2023 ); Coste et al. ( 2023 ); Zhang et al. ( 2024 ); Wang et al. ( 2024 ); Eisenstein et al. ( 2023 ) ). Specifically, the training rule seeks to find the model that maximizes a certain objective function . That is,
where is a measure of the distance between and . We assume that the function has a unique global optimal point for any possible inputs. Hence, in the rest of the paper, the “ ” in the definition of is substituted by “=”.
After observing the announced mechanism ( , ), each group reports a reward model, , and its group size . We assume all reported sizes are in and therefore bounded by . Based on the reported information, the provider fine-tunes the model until the model is optimal, i.e., the final parameter satisfies . The provider then charges group according to the payment rule, . All the members in the group have access to the fine-tuned model , so the valuation for group is . We assume all groups have quasi-linear utilities. Therefore, group ’s utility is
The groups may strategically report, thus and do not necessarily equal the true and . The goal of the LLM provider is to achieve its training objective based on the group’s true preferences, taking into account that the misreporting may distort the training outcome. To this end, it is crucial to incentivize all groups to report their information truthfully so that the provider is accessible to the groups’ private information. We formally define these desiderata of a mechanism as follows.
A mechanism satisfies dominant-strategy incentive compatibility (DSIC) if , , , , , , , , we have
| (DSIC) |
A mechanism satisfies individually rationality (IR) if , , , , , , we have
| (IR) |
DSIC means that for any group, truthfully reporting the reward model and the group size yields the highest utility, regardless of other groups’ reports. IR means that truthfully reporting always yields non-negative utilities. Only when both DSIC and IR are satisfied, all groups are incentivized to participate in this game and report truthfully. When a mechanism satisfies DSIC, IR, or both DSIC and IR, we say that the payment rule implements in DSIC, IR or both DSIC and IR. Especially, when we say the implementability of a training rule, we refer to the property of DSIC.
In this section, we discuss the incentive design within the RLHF Game framework. As a warm-up, we consider a simplified scenario where all group sizes are equal to , i.e., , and this information is public to all groups and the provider. Consequently, each group is required only to report its reward model. For convenience, we let and omit the notation of . Unless stated otherwise, the results directly apply to the more general case where is also private information.
For the valuation function in this section, we consider a reasonable form defined as follows.
For any agent with preference represented by reward model rm , its valuation on model is its expected reward on the sequences generated by :
In practice, this can be obtained by averaging the reward of the sequences sampled from an LLM. We discuss the influence of possible errors in this process in Section 4 .
We begin by demonstrating the necessity of payment rules to ensure incentive compatibility for training rules under the following assumptions.
(1) For all , exists and . exists and . (2) The distance measure function satisfies that for all , exists and is positive. (3) For all and , the fine-tuned model satisfies that for all .
The rationale of these assumptions is as follows: (1) is that we assume the training process aims to find a model that not only brings higher valuation for all agents but also remains close to the initial model . (2) is like a convex condition in which we assign an increasingly large penalty on when it becomes farther from . And (3) is to exclude some extreme training rules that the training outcome remains the same for most input and changes drastically. In practice, (1) is satisfied for most training functions , including those aiming to maximize social welfare and Nash social welfare. (2) and (3) depend on the choice of the regularization measure and the strength of regularization. At least, they are satisfied by the commonly used KL-divergence.
Combining these three conditions, we show that when the preference for some ( ) increases and others remain, the probability of for the optimal model will also increase. In this case, an intuitive manipulation is that the agent reports a polarized reward model: higher reward value for the it values most. We show that this strategy will give strictly higher utility to the agent unless the agent is indifferent among outcomes in a subset and does not care about the outcomes outside at all.
Here, we call a strategy strongly dominated when another strategy yields strictly higher utility regardless of others’ reports. Theorem 3.3 tells us that truthful reporting is strongly dominated with only training rules, and thus will not be adopted by rational agents.
Having established the necessity of payment rules in this scenario, we mainly address two questions in the remainder of this section: First, given a training rule , can we find a payment rule such that the mechanism satisfies DSIC? This is the so-called implementability of a training rule . Second, for an implementable training rule , can we identify the relationship between the payment rules s among all DSIC mechanisms .
We resolve the first question primarily by utilizing the notion of cycle monotonicity , first proposed by Rochet ( 1987 ) . Cycle monotonicity generalizes monotonicity defined in a single-parameter scenario ( (Myerson, 1981 ) ). In the RLHF Game, we define a function as . measures the valuation gains from misreporting ( ) to truthfully reporting ( ) under and . The cycle monotonicity is defined based on this function:
The training rule satisfies cycle monotonicity if for any , any finite and distinct sequence of reward models , of agent ( ), and any we have
For general training rules, cycle monotonicity is a sufficient and necessary condition for implementability.
A training rule is implementable if and only if it satisfies cycle monotonicity.
In fact, the proof of Theorem 3.5 is constructive. However, for general implementable training rules, the calculation of the payment rules is too complex to be practical.
The second question is more general, so we primarily consider the concept of payment equivalence ( (Ashlagi et al., 2010 ) ) for an implementable training rule.
An implementable training rule satisfies payment equivalence if for any two mechanisms and satisfying DSIC, there exists a function such that
Or equivalently, when fixing and , there exits a constant such that for all .
Payment equivalence indicates that the only way to modify a DSIC mechanism to while maintaining incentive compatibility is to add a term that is independent of ’s report to agent ’s payment function . Thus, the payment equivalence of is sometimes interpreted as the uniqueness of the payment rule that implements it in DSIC. This notion is strong and useful since when a training rule satisfies payment equivalence and we can figure out one mechanism that satisfies DSIC, then all the payment rules that implement in DSIC are characterized. In particular, it is possible to find the revenue-maximizing payment rule among all these payment rules that implement in both DSIC and IR.
Payment equivalence is influenced by the domain of the types: reward models and group sizes in the RLHF Game. When , the agents only report the reward models whose domain contains all normalized reward models rm . Therefore, for all , the domain of the whole private information is exactly , which is a connected set in the Euclidean space. Thus, we can directly apply the result in Nisan et al. ( 2007 ) and get the following theorem.
When is public information and the agents only report the reward models, all implementable training rules satisfy payment equivalence.
However, when the group sizes is also a part of the private information for all groups, the domain of the whole private information becomes that is no longer a connected set because . Thus, payment equivalence may not be satisfied for general training rules, and we will study this for a representative set of training rules in the following section.
In this section, we consider the scenario where group consists of agents, and each group must simultaneously report its reward model and size. Our objective is to design a mechanism that incentivizes each group to truthfully report both and . For general training rules , though it is possible to adopt the method used in the constructive proof for Theorem 3.5 to derive the payment rule, the resulting payment rule can be complex and impractical.
Therefore, in this section, our primary focus is on a subset of training rules designed to maximize social welfare under regularization constraints, which is commonly used in practice to aggregate various preferences ( Boyd and Vandenberghe ( 2004 ); Nocedal and Wright ( 1999 ) ), balancing efficiency and fairness.
Given the reports , , and the initial model , a SW-Maximizing training rule fine-tunes the model to maximize the social welfare under a regularization penalty measured by some metric . Formally, it is represented as:
Here, is a hyperparameter that adjusts regularization strength.
Note that SW-Maximizing training rules constitute a set of training rules. We use to indicate that is a member of this set. Furthermore, similar to the (3) in 3.2 , we also assume that for all , and , the fine-tuned model satisfies that for . One simple way to achieve it is to set a large and hence the training result is close enough to .
We introduce the affine maximizer payment rule ( Roberts ( 1979 ) ) , a weighted version of VCG payment ( Vickrey ( 1961 ); Clarke ( 1971 ); Groves ( 1973 ) ):
The notations and refer to the affine social welfare with and without group when the reported reward models are , the reported number of agents are , the initial model is , and the parameters of model is . The affine social welfare consists of both the groups’ valuations and the regularization term. Formally,
We show that implements SW-Maximizing training rules in both DSIC and IR, which implies that truthfully reporting both reward models and group sizes constitutes a dominant Nash Equilibrium in this mechanism.
For any , mechanism satisfies DSIC and IR.
Regarding payment equivalence, as we have mentioned in the previous section, the domain is not connected in the Euclidean space since , the results in Nisan et al. ( 2007 ) can not be directly applied. However, we show that under the following assumption, SW-Maximizing training rules satisfy payment equivalence.
For any , there exists a such that for any , , , and , if , then , where and .
This assumption is reasonable for most measures in SW-Maximizing training rules as the space of is continuous. The continuity ensures that when the reported information and are sufficiently close, the training outcomes and should also be close. Specifically, we validate this assumption for some widely used distance measures.
4.3 holds for SW-Maximizing training rules with regularizations KL-divergence, , and distance, .
Under this assumption, we derive the following result:
With the property of payment equivalence, we further investigate the revenue-maximizing payment rule that implements SW-Maximizing training rules in both DSIC and IR. Finding the revenue-maximizing multi-parameter mechanism is a challenging problem in classic mechanism design theory. However, since we have proved the payment equivalence for SW-Maximizing training rules, we can utilize the necessary condition defined in Definition 3.6 to formulate it as a optimization problem. Solving this problem provides the optimal payment rule under the same conditions.
The relationship between the domains , and this corollary is reflected in two aspects. First, the establishment of payment equivalence depends on the assumptions of the choice of , , particularly considering includes all normalized reward models. Second, based on payment equivalence, finding the revenue-maximizing mechanism satisfying IR also needs information on the exact domains.
In this part, we discuss the influence of error generated in practice on the incentive property in the RLHF Game. We abstract it as an approximate valuation problem ( Chiesa et al. ( 2012 ) ). Formally, when group reports its reward model , the mechanism may not use exactly but rather a noisy reward model with a conditional distribution as the input into the mechanism. We argue that this abstraction has covered various error cases. One example is that the calculation of valuation defined in 3.1 requires sampling sequences from LLM, which may result in a deviation from the true valuation. Another example is that the fine-tuned model may not be exactly optimal for the reported reward models. However, this model can be considered as the optimal for the deviated reward models.
We assume that agent groups are aware of the noise when feeding preferences into the mechanism. Therefore, their utilities will take it into account and have a different form. We use the capital letter to represent agent ’s revised utility. Formally, for group with reward model and group size , its utility for reporting is given by
Note that in defining , we implicitly assume that each group is unable to know the other group’s noise information. Therefore, the expectation is not taken concerning .
We only consider the case when the noised input to the mechanism and the reported reward models are close:
For any profile of reported reward models , any profile of reward models that can be generated from s with non-zero probability satisfies
We first show that by directly applying results in Section 4.1 to the noised input, the loss in the social welfare is upper-bounded by .
For training rule , a group’s utility in the mechanism consists of an affine social welfare term . Therefore, we can derive the following theorem based on Lemma 4.8 .
This means that for any group , the maximum gain of misreporting is less than regardless of the others’ reports. Agents will tend to truthfully report in cases where finding the optimal strategy is costlier than .
Research involving multiple reward models primarily focuses on developing algorithms to enhance practical performance. Some studies design methods to simultaneously satisfy multiple preferences ( Ramé et al. ( 2024 ); Wu et al. ( 2024 ); Jang et al. ( 2023 ); Park et al. ( 2024 ) ). Additionally, there is a body of work that trains multiple models for a single preference and then ensembles them to improve the robustness of RLHF ( Coste et al. ( 2023 ); Zhang et al. ( 2024 ) ), mitigate the influence of incorrect and ambiguous preferences in the dataset ( Wang et al. ( 2024 ) ), and reduce reward hacking ( Eisenstein et al. ( 2023 ) ). Unlike these approaches, our work considers how to collect misaligned preferences truthfully from different agents.
Several studies have explored the properties relevant to our paper in various multi-parameter auction scenarios, such as implementability ( Rochet ( 1987 ); Miyake ( 1998 ); Conitzer and Sandholm ( 2004 ); Saks and Yu ( 2005 ); Bikhchandani et al. ( 2006 ); Ashlagi et al. ( 2010 ) ) and payment equivalence ( Ivanova-Stenzel and Salmon ( 2008 ); Heydenreich et al. ( 2009 ); Bergemann and Välimäki ( 2010 ); Pavan et al. ( 2014 ) ). Another central topic in auction theory is to design mechanisms that satisfy DSIC and IR while maximizing the expected revenue for the auctioneer. Although the single-parameter scenario has been resolved by Myerson ( 1981 ) , the optimal auction design for multi-parameter settings remains an open question. Therefore, there is a stream of research focusing on a specific subset, affine maximizer auctions, which inherently satisfy DSIC and IR ( Sandholm and Likhodedov ( 2015 ); Roberts ( 1979 ); Likhodedov and Sandholm ( 2004 ); Briest et al. ( 2010 ); Tang and Sandholm ( 2012 ); Jehiel et al. ( 2007 ) ), and proposes optimizations to enhance empirical performance ( Curry et al. ( 2022 ); Duan et al. ( 2024a , b ) ). Compared to these works, we are the first to discuss the property of payment equivalence and the revenue-maximizing solution in the scenario of fine-tuning LLMs.
Other works also explored the intersection of game theory and large language models. Some research has proposed algorithms for training LLMs inspired by concepts in game theory, such as Nash learning from human feedback ( Munos et al. ( 2023 ) ), consensus game ( Jacob et al. ( 2023 ) ), and direct Nash optimization ( Rosset et al. ( 2024 ) ), and Gemp et al. ( 2024 ) . Furthermore, various studies assess LLMs from a game-theoretical perspective, examining aspects such as rationality ( Chen et al. ( 2023 ); Fan et al. ( 2023 ) ), behavior in matrix games ( Akata et al. ( 2023 ); Gandhi et al. ( 2023 ); Lorè and Heydari ( 2023 ) ), and performance in strategic games like auctions ( Guo et al. ( 2023 , 2024 ) ), Werewolf ( Xu et al. ( 2023a , b ) ), and Avalon ( Wang et al. ( 2023a ) ).
In the RLHF Game with groups, calculating requires separate complete training processes of different s. This can result in inefficiency due to the costly training. To address this problem, we propose two modifications to . Both modifications involve computing an approximate , instead of the true optimal when calculating payments:
Calculate an approximate , where are the parameters saved in the process of training .
Adopt less iterations in the training process for calculating . And thus get a result that is not optimal.
The first method needs only one training process (for ) but affects the property of DSIC since the saved parameters are also influenced ’s report. In comparison, the second approach incurs higher training costs but guarantees strict DSIC.
This paper investigates incentive design in fine-tuning large language models using multiple reward models. We formalize this scenario as the RLHF Game, where a service provider proposes training and payment rules, and agents strategically report their preferences. We demonstrate the necessity of payment schemes for incentivizing truthful reporting in general training rules and provide a comprehensive characterization of payment schemes that implement SW-Maximizing training rules in dominant strategies. These findings enhance the theoretical understanding of mechanism design in LLM fine-tuning and offer guidelines for implementing effective RLHF-based systems in various contexts.
Future research in this field presents several promising directions. Firstly, investigating mechanisms integrating efficiency and incentive compatibility within the RLHF Game could significantly enhance its applicability in real-world scenarios. Secondly, modeling and examining more complex training rules, such as dynamic training rules, could deepen the understanding of this framework. Thirdly, designing mechanisms for more general cases that aggregate preferences into multiple models based on diversity considerations is crucial. Additionally, applying mechanism design theory to other scenarios related to large language models, such as API charge schemes, retrieval-augmented generation (RAG), and prompt engineering, offers valuable opportunities for further exploration.
See 3.3
For the case that , the optimization of can be written as a programming problem:
Because of the (3) of 3.2 , we can infer that the condition , is not active for the optimal solution. Further, by the (1) of 3.2 , all the partial derivatives exist. Since the optimal solution is also a local extreme point, the necessary condition for the optimal is that there exists ( Luenberger et al. [ 1984 ] ), such that
Under 3.1 , , so we have
| (1) |
When the real reward model for agent is , we construct a report reward model and show that . We denote as the set of sequence with non-zero reward value, i.e. and take the . Then we take a small and define as:
Intuitively, assigns more value to the element with the highest value, and less to the element with the lowest non-zero value. Let and , we analyze the variation of the corresponding policies and . We use and to denote the variable in the necessary condition for and and we can derive the following results.
(a) and . We prove the former by contradiction: if , then by the assumption of , we have
With , and , we can infer that . However, since for all , we have
to satisfy the optimal condition in (1), there must be for all ,
Which is equivalent to , and hence results in . And the latter, , can be proved by totally same method.
(b) The order of and for all is consistent. Without loss of generality, we assume there is such that . Then we have
Since , we can infer that . Then for all , to satisfy (1), there must be
which is equivalent to . Similarly, if there is such that , then for all , there is .
Finally, with the results in (a) and (b), when for all , there is
When for all , there is
Note that both (2) and (3) are because of . And unless , which means that all have the same reward value , the “ ”s are hold. ∎
See 3.5
We first prove the necessity: if is implementable, it satisfies cycle monotonicity. Since is implementable, there exists such that satisfies DSIC. Then for any , , and any finite and distinct sequence of reward models , , we let and . By the property of DSIC, we have
By definition of the function , this is equivalent to
Sum over all , we get
This means that satisfies cycle monotonicity.
Then we prove the sufficiency: if satisfies cycle monotonicity, it is implementable. For any two , we use function to denote the shortest length measured by for any sequence from to . In formal,
By cycle monotonicity, we have that for any finite and distinct sequence ,
By the arbitrariness of the sequence, we can infer that
Since is bounded, is also finite and . Then we can establish a payment rule such that for any agent ,
where is defined as follows.
| (1) |
Then, for any , , and , we have
Note that (2) comes from the definition of that:
This means that mechanism satisfies DSIC, and suffices to show that is implementable. ∎
See 3.7
We follow the result Theorem 1.37 in Nisan et al. [ 2007 ] .
Assume that the are connected sets in the Euclidean space, then all implementable training rules satisfy payment equivalence.
Since in our paper, we assume that for all , is the set of all non-negative and normalized -dim vectors, this is a connected set in the usual metric in the Euclidean space. Therefore, the theorem holds. ∎
See 4.2
We assume that for group , the true reward model is and the agent number is . The reports of other groups are and the initial model is .
(1) satisfies DSIC.
We compare the utility between reporting and any other . For convenience, we first simplify the notations by letting
The valuation of group is the valuation for each agent multiply the real agent number:
According to the payment rule , the payment for and for is
Therefore, we can calculate the change in the utility:
The last inequality holds by the definition of
Therefore, we can conclude that, for all , and any possible , , we have
(2) satisfies IR.
We reuse the notations above and denote to be the optimal parameter for groups except for , i.e. . When group truthfully report its reward model and agent number , the utility can be written as:
Therefore, we can conclude that, for all , , we have
∎
See 4.4
(1) For (KL-divergence), since is a finite set, we can rewrite the training rule as an optimization problem as follows:
Since we have assumed that the optimal point is unique, and the optimal model satisfies that , for all . The necessary condition for an optimal is that there exists , such that
Similarly, for the input , there exists , such that the optimal satisfies
For convenience, we define . Then the relationship between and is given by
Note that we also have the condition
Since , we can infer that
This is equivalent to
Thus, the difference for and can be bounded by
For any , when we set , we have
(2) For ( distance), since is a finite set, we can rewrite the training rule as an optimization problem as follows:
Since we have assumed that the optimal point is unique, and the optimal model satisfies that , for all . The necessary condition for an optimal is that there exists , such that
Similarly, for the input , there exists , such that the optimal satisfies
For convenience, we define Then the relationship between and is given by
Note that we also have the condition
Since , we can infer that
This is equivalent to
Thus, the difference for and can be bounded by
For any , when we set , we have
∎
See 4.5
We prove the equivalent version of payment equivalence: For any group , when fixing other groups reports and , any two payment rules , that implement in DSIC must satisfy that there exists a constant , such that for any and . Therefore, in the rest of the proof, we suppose fixed and and will omit these notations.
We first redefine the functions and which have been introduced in the previous section. The function and are defined on the types space of the group. For the simplified case in Section 3 , the type is exactly the reward model . But when we consider both and , the type should contain both information, so we use to represent the combination . And the domain of is . Note that, without specially claim, is used to represented for the and with the same superscript and subscript, for example, .
Similar to the simplified version discussed in Section 3 , we let be the change in valuation from misreport type to truthfully report type . In formal,
And refers to the smallest values of on a finite and distinct path from to
We first prove the following lemma, which is a special case in Heydenreich et al. [ 2009 ] ,
An implemented training rule satisfies payment equivalence if for any agent , and any types , , we have
Assume there is a mechanism satisfies DSIC. For any two types , and a finite and distinct sequence , let and , we have that
This can be rewritten as
Sum over , we get the following inequality
Since this holds for arbitrary finite and distinct sequences, we can infer that . Similarly, there is . Combining these results with , there is
which means that . Note that this holds for arbitrary and . Therefore, when for some , the payment is determined, then the payment for all other s are determined. For example, if there are any two payment rules and both implement in DSIC, and we set the payment when reports uniform reward model defined in Equation 1 and as and respectively, then
Note that and are not influenced by ’s report, but they may vary for different , and , which means that we can consider the term as a function on . ∎
Then we show that under 4.3 , SW-Maximizing training rule satisfies the condition stated in Lemma B.1 . Firstly, we show that for any , we have . By definition of the function , and refer to the shortest path from to and from to respectively, which means that is the shortest weight for a cycle that goes through and . Since SW-Maximizing training rule is implementable, by Theorem 3.5 , we know that the weight for any cycle is non-negative. Therefore, must be satisfied.
Then we show that for any and , . We prove this by constructing a finite and distinct sequence such that
| (2) |
This is suffice for since and are the lower bound for and respectively.
Initially, we rewrite the LHS of Equation 2 by the definition of the function .
In the above equations, for .
By 4.3 , when , and are fixed, there exits such that if , then , where and .
We construct the sequence as follows: we set , and let . For each ,
And for each ,
In this construction, any is either an weighted average of rm and or and . This ensures that the all reward models in the sequence is valid (normalized and non-negative). We can then divide the above equation into three parts, making the the same in the first and the last parts.
| (a) | ||||
| (b) | ||||
| (c) |
We first show that (b) equals to by proving . By contradiction, if and the uniqueness of the optimal point, we have that
Note that for all , we can calculate that . Thus, the above equation can rewritten as:
This contradicted the optimality of . Therefore, and must be identical, which means that (b) equals to .
Then we turn to (a). By the construction, for any and , , so that holds for all . Then we can derive that:
The case is similar to (c). By the construction, for any and , , so that holds for all . Then we can derive that:
Combining the results from (a), (b), and (c), we have that under this construction,
By the arbitrariness of , this is suffice to demonstrate that .
Therefore, it is proven that
which means that . By Lemma B.1 , this is a sufficient condition for the payment equivalence of . ∎
See 4.6
Given the payment equivalence of and we know that satisfies DSIC, we can formulate the problem of finding the revenue-maximizing DSIC and IR payment rule as a programming problem. Because of the symmetricity, we only consider the payment for agent here.
| s.t. |
The solution of this programming can be trivially given by,
Therefore, the revenue-maximizing payment is
∎
For any , if , then for any model , we have
We can derive that
∎
See 4.8
See 4.9
Recall that the calculation of payment in is
Let , the utility function can be written as:
where we define , and . Note that the term is not influenced by the change of or .
Therefore, we can derive that:
All the in the above inequalities refer to the optimal parameter for input , i.e. . Specifically, (1) and (3) come from the bounded distance between and ( Lemma B.2 ). (2) and (5) hold by the definitions: and . And (4) holds since the inner term is irrelevant to .
Therefore, we get
∎